Parse-O-Matic: Regular Expressions

<h1>Parse-O-Matic: Regular Expressions</h1>

<hl>Note: This is an appendix to the "Parse-O-Matic Scripts" user manual.</hl>

<h2>Table of Contents</h2>

You can click on any section title below to jump directly to that section.

   <a href="#OVERVIEW">Overview</a>
   <a href="#BASICREX">Basic Regular Expressions</a>
   <a href="#ASTERISK">Using the Asterisk</a>
   <a href="#ADVANCRE">Advanced Regular Expressions</a>

<a name="OVERVIEW"><h2>Overview</h2>

In the following list, the letters x, y and z stand in for any character.
<hl>
  ^xxx    Matches a sequence of characters at the start of a line
  xxx$    Matches a sequence of characters at the end of line
  x.x     Matches a single character
  [xz]    Matches a set of characters ('x' and 'z' in this example)
  [x-z]   Matches a range of characters (this example covers 'x' to 'z')
  x*      Matches zero or more occurrences of the preceding character
  [xyz]*  Matches zero or more occurrences from the preceding set
  [x-z]*  Matches zero or more occurrences from the preceding range
  [^xyz]  Matches any character but the ones specified
  [^x-z]  Matches any character but the ones in the specified range
</hl>
The backslash (\) character has a special meaning in regular expressions:
<hl>
  \x      Means "take the next character literally"
          For example:  \[  means the actual  [  character
          rather than the start of a set or range
  \t      Means "a tab character" (ASCII character 9)
</hl>
<a name="BASICREX"><h2>Basic Regular Expressions</h2>

Here are some examples of matches:
<hl>
  C.t        Matches Cat, Cot, Cut, Cxt, C3t etc.
  C[aou]t    Matches Cat, Cot, Cut only
  B..d       Matches Bird, Bred, Bead etc.
  ^Dog       Matches Dog only if it is at the beginning of a line
  Moose$     Matches Moose only if it is at the end of a line
  Pa*d       Matches Pd, Pad, Paad, Paaad etc.
</hl>
<a name="ASTERISK"><h2>Using the Asterisk</h2>

The last example given above uses the * character to indicate zero, one or more occurrences of a particular character — in this case, the letter 'a'. This is <b>different</b> from the way the Windows operating system uses the * wildcard character. In Windows, the * wildcard matches "any single character".

In regular expressions, however, the asterisk is specific about what you are looking for. That is why 'Pa*d' would not match 'Parsed'; the asterisk means "match zero or more of the preceding character specification".

If you actually want to search for 'Pa' followed by one or more letters and then 'd', the correct syntax is:

<hl>  Pa[a-z][a-z]*d</hl>

This means that we want to match 'Pa', then a letter in the range from 'a' to 'z', then some number (including zero) of characters in the 'a' to 'z' range, and finally the letter 'd'. The character string 'Parsed' would meet these criteria, as would 'Pad', 'Paid' and 'Packed'.

<a name="ADVANCRE"><h2>Advanced Regular Expressions</h2>

Here are some more complicated examples of regular expressions:
<hl>
  C[^ou]t        Matches Cat, Cxt and so on, but not Cot or Cut
  C[ao]*t        Matches Ct, Cat, Caat, Cot, Coot, Cooot, Coat, Coaoat etc.
  [0-9][0-9]*    Matches numbers such as 0, 1, 01, 10, 25, 0990, 9999 etc.
  -[0-9][0-9]*   Matches negative numbers such as -0, -1, -19, -12345 etc.
</hl>
In the last example, [0-9] is specified twice to ensure that at least one digit is found. Bear in mind that the * character means "zero or more occurrences". If you had only specified '-[0-9]*' you would get a spurious match within the string 'Hello - there', since the '-' character is indeed found, followed by <b>zero</b> occurrences of the digits 0 through 9.

You can create fairly complex patterns using regular expressions. Consider this example:

<hl>  \$[0-9][0-9]*\.[0-9][0-9]</hl>

This would match dollar amounts with two decimal places, such as $0.00, $03.23, $3.14, $9.99, $1234.56 and so on.